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graduate course. None of the results are new. They are presented here because they do 
not appear to be elsewhere available in compact form. 



Abstract 

The mathematics of linear fits is presented in covariant form. Topics include: corre- 
lated data, covariance matrices, joint fits to multiple data sets, constraints, and extension 
'{T)'- of the formalism to non-linear fits. A brief summary at the end provides a convenient 
crib sheet. These are somewhat amplified notes from a 90 minute lecture in a first- year 
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Expectations and Covariances 

y— i ■ Let y be random variable, which is drawn from a random distribution g(y). Then, we 

^ . define the "expected value" or "mean" of y as 

i> : 

o ; (y) = J yg(y) d y / J a(y)dy. 

m : 

Using this definition, it is straightforward to prove the following identities, 
9< (Vi +Ifc> = <Vi> + <!&>, (ky) = k(y), (k) = k, ((y)) = (y), 

o : 

+3 • where k is a constant. We now motivate the idea of a "covariance" of two random variables 

oj '. by noting 

>! (s/il/2) = (s/i)(s/2> + cav(yi,2/a) 

•i-H . 

^ ' where 

cov(yi,y 2 ) = ((yi - (j/i))(j/2 - (2/2))) = (2/12/2) - <2/i><2/2>, 

the last step following from the identities above. If, when y± is above its mean, then y2 also 
tends to be above its mean (and similarly for below), then cov(yi, 2/2) > 0, and then yi is 
said to be "correlated" with y 2 . If, when yi is above then y 2 is below, then cov(yi, y 2 ) < 0, 
and the y\ and y 2 are said to be "anti-correlated" . If cov(yi, y 2 ) = 0, they are said to be 
"uncorrelated" . Only in this case is it true that (2/12/2) = (2/1) (2/2)- Three other identities 
that are easily proven, 

cov(ky u y 2 ) = k-cav(y 1 ,y 2 ), cov(y u y 2 + y 3 ) = cov(y u y 2 ) + cov(y u y 3 ), cov(y,k) = 0. 

The covariance of a random variable with itself is called its variance. The error, a, is 
defined to be the square root of the variance, 



var(y) = cov(y, y) = (y 2 ) - (y) 2 , a(y) = A/var(y) 
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Definition of x 2 



Suppose I have a set of N data points y k with associated (uncorrelated for now) errors 
<Tfc, and I have a model that makes predictions for the values of these data points yk,mod- 
Then x 2 is defined to be 

N i ^2 
2 _ \Vk — yk,mod) 

k=l ° k 

If the errors are Gaussian, then the likelihood is given by C = exp(— x 2 /2), so that min- 
imizing x 2 is equivalent to maximizing C. However, even if the errors are not Gaussian, 
X 2 minimization is a well-defined procedure, and none of the results given below (with 
one explicitedly noted exception) depend in any way on the errors being Gaussian. This 
is important because sometimes little is known about the error distributions beyond their 
variances. 

More generally, the errors might be correlated. Although in practice this is the excep- 
tion, it makes the math much easier to consider the more general case of correlated errors. 
In this case, the covariance matrix, Cki, of the correlated errors (and its inverse Bki) are 
defined by 

C k i = cov(y k ,yi), B = C -1 , 
Both C and B are symmetric. Then % 2 is written as 

N N 



X 2 = X Y(Vk ~ V k, mod) B M (y i - yz, mo d) 



k=i i=i 



Note that for the special case of uncorrelated errors, C^i = bki&\-, where the Kronecker 
delta is defined by 

S kt = l (k = l), 5 kl = ik^l). 
In this case, B k i = Skicr^ 2 , so 



N N c N 



2 V^V^/ \&kl t \ (Vk - yk,mod)(yk - 2/fc,mod) 

x =2^z^(yk-yk,mod) — {yi-yi,mod) = 2_^ 2 

k=l 1=1 ° k k=l ° k 

which is the original definition. 



Linear model 



A linear model of n parameters is given by 

n 

2/mod = J~]ajfi(x), 
i=l 

where the fi(x) are n arbitrary functions of the independent variable x. The independent 
variable is something that is known exactly (or very precisely), such as time. If the 
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independent variable is subject to significant uncertainty, the approach presented here 
must be substantially modified. 
We can now write x 2 , 



N N 



k=i 1=1 



i=l 



B. 



ki 



3=1 



The proliferation of summation signs is getting annoying. We can get rid of all of them 
using the "Einstein summation convention" , whereby we just agree to sum over repeated 
indices. In the above cases, these are k, We then write, 

X 2 = [Vk ~ a,ifi{x k )]B k i[yi ~ ajfjixi)], 

which is a lot simpler. Note that k and / are summed over the N data points, while i and 
j are summed over the n parameters. 

Minimizing % 2 

The general problem of minimizing % 2 with respect to the parameters ai can be 
difficult, but for linear models it is straightforward. We first rewrite, 

X 2 = VkBuVi - 2Mi + dibijdj, 

where the di and bij are defined by 

di = ykBkifi(xi), = fi(x k )B k ifj(xi). 

We find the minimum by setting all the derivatives of x 2 with respect to the parameters 
equal to zero, 



= 



dx' 
da„ 



^fiimdi ~\~ Oi m bij CLj ~\~ <XibijSj 7 



-2d m + bmjdj + Clibi m 



^{bmjttj <i m ), 



where I have used dai/da m = 5i m and where I summed over one dummy index (i or j) in 
each step. This equation is easily solved: 



dm — b m j(lj =r* Cli — Cijdj, 



where c^- is defined as the inverse of b^ 



c=b-\ 



Note that the di are random variables because they are linear combinations of other random 
variables (the yk), but that the b^ (and so the c^) are combinations of constants, and so 
are not random variables. 
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Covariances of the Parameters 



We would like to evaluate the covariances of the di, i.e., cov(ai, dj), and derive their 
errors a(di) = \fcov(a,i, dj). The first step to doing so is to evaluate the covariances of the 

di, 

cov(di,dj) = co\r[y k B k ifi(xi),y p B pq fj(x q )] = B k ifi(xi)B pq fj(x q )cov(y k ,y p ). 
Then, using the definition of C kp and successively summing over all repeated indices, 

cov(di,dj) = B k ifi(xi)B pq fj(x q )Ckp = Bkifi(xi)8 kq fj(x q ) = fi(xi)B k ifj(x k ) = 6 y . 

Next, we evaluate the covariance of di with dj, 

cov(a;, dj) = cav(c im d m , dj) = c im cov(d m , dj) = c im h m j = 

Finally, 

cov(a;, a,j) = cov(c im d m , dj) = c im cov(d m , a,) = c im S m j = Cij. 

That is, the c^, which were introduced only to solve for the di, now actually turn out to 
be the covariances of the a^. In particular, 

o-(a,i) = y/c^. 

This is a very powerful result. It means that one can figure out the errors in an experiment, 
without having any data (just so long as one knows what data one is planning to get, and 
what the measurement errors will be). 



A simple example 



Let us consider the simple example of a two-parameter straight-line fit to some data. 
In this case, 

2/mod = + a 2 f 2 (x) =d 1 + a 2 x, 

i.e., f\(x) = 1 and f 2 (x) = x. Let us also assume that the measurement errors are 
uncorr elated. Hence 

N N 
k=i a k fc=i a k 



N - N N 



6ll = ^-2, ^12=^21 = ^-2, 6 22 = ^-|- 

k=l k k=l k k=l k 

Let us now further specialize to the case for which all the o k are equal. Then 

o*Ux) <z 2 > 



where the "expectation" signs now mean simply averaging over the distribution of the 
independent variable at the times when the observations actually take place. This matrix 
is easily inverted, 

cr 2 / (x 2 ) -(x) ' 



N((x 2 ) - (x) 2 ) 
In particular, the error in the slope is given by 

[cr {a2)\ = var(a 2 ) = c 2 2 = 



Nvav(x) N (Ax) 2 ' 

where I have used var(;r) as a shorthand for (x 2 ) — (x) 2 , and where in the last step I 
have evaluated this quantity for the special case of data uniformly distributed over an 
interval Ax. That is, for data with equal errors, the variance of the slope is equal to 
the variance of the individual measurements divided by the variance of the independent- 
variable distribution, and then divided by N. 

Nonlinear fits 

Consider a more general model, 

l/mod = F(x;ai ...a n ), 

in which the function F is not linear in the aj, e.g., F(x;ai,a2) = cos(aia;) exp(— a.2x). 
The method given above cannot be used directly to solve for the c^. Nevertheless, once 
the minimum a° is found, by whatever method, one can write 

F(x; a 1 ...a n ) = Fo + (a t - a°)/i(x) + ... 

where 

dF(x;ai ...a n ) 
fi{x) = — 



dai 

Then, one can calculate the bij, and so the c^, and thus obtain the errors. In fact, this 
formulation can also be used to find the minimum (using Newton's method), but the 
description of this approach goes beyond the scope of these notes. 

Expected Value of \ 2 

We evaluate the expected value of x 2 after it has been minimized 

(X 2 ) = (VkBkiyi) ~ 2(aidi) + (oibijdj) = (y k B k iyi) - (a^i), 

where I have used the minimizing condition di = bijcij. To put this expression in an 
equivalent form whose physical meaning is more apparent, it is better to retreat to the 
original definition, 

(X 2 ) = ([Vk ~ Oifi{x k )]B k i[yi -ajfj{xi)\) 
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or 

(X 2 ) = cov([y k - aifi(x k )},Bki[yi - ajfj(xi)}) + ([y k - aifi(x k )})Bki{[yi - ajfj(xi)}). 
The first term can be evaluated, 

cov([y k - aifi(x k )}, B k i[yt -ajfj(xi)]) = B k iCOv(y k ,yi) - 2cov(a i X) + 6 ij -cov(a i , a,) 

= B k iC k i - 25a + bijCij = 5 kk - 5a, 

using cov(ai, dj) = Sij. Since 5 kk = N and 8u = n (note that repeated indices still indicate 
summation), we have 

(X 2 ) = N - n + (y k - aifi(x k ))B k i(yi - ajfj(xi)). 

The last term is composed of the product of the expected difference between the model 
and the data at all pairs of points. For uncorrelated errors, 

(yk ~ a,ifi(xk))Bki{yi - a>jfj(xi) -> V — — a% ^ Xk ^ . 

-— * (J . 



k=i 



k 



If the model space spans the physical situation that is being modeled, then this expected 
difference is exactly zero. In this case, (x 2 ) = iV — n, the number of data points less the 
number of parameters. If this relation fails, it can only be for one of three reasons: 1) 
normal statistical fluctuations, 2) misestimation of the errors, 3) problems in the model. 
For Gaussian statistics, one can show that 

var(x 2 ) = 2(N -n). 

So if there are, say 58 data points, and 8 parameters, then x 2 should be 50 ± 10. So if it is 
57 or 41, that is ok. If it is 25 or 108, that is not ok. If it is 25, the only effect that could 
have produced this would be that the errors were overestimated. If it is 108, there are two 
possible causes. The errors could have been underestimated or the model could be failing 
to represent the physical situation. For example, if the physical situation causes the data 
to trace a parabola (which requires 3 parameters), but the model has only two parameters 
(say a 2-parameter straight line fit), then the model cannot adequately represent the data 
and the third term will be non-zero and so cause x 2 to go up. 

Combining Covariance Matrices 

Suppose one has two sets of measurements, which had been analyzed as separate x 2 ' s 5 
Xi and xl- Minimizing each with respect to the a» yields best-fits a\ and a 2 , and associated 
vectors and matrices dj, bjj, cjj, <i 2 , 6 2 J 7 and c 2 j. Now imagine minimizing the sum of these 
two x 2 's with respect to the a m , 
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which yields a combined solution, 



a i = c ij d j , di = d\ + d%, &y = &y + &y, c^fr 1 . 

But it is not actually necessary to go back to the original x 2, s. If you were given the results 
of the previous analyses, i.e. the best fit a\ and c*- and aj and c 2 - for the two separate fits, 
you could calculate b 1 = (c 1 ) -1 and d\ = bjjCi^ and similarly for "2", and then directly 
calculate the combined best fit. 

Linear Constraints 

Suppose that you have found a best fit to your data, but now you have obtained 
some additional information that fixes a linear relation among your parameters. Such a 
constraint can be written 

Ki^Cli — — Z. 

For example, suppose that your information was that a\ was actually equal to 3. Then 
k = (1,0,0,...) and z = 3. Or suppose that the information was that a\ and were 
equal. Then k = (1, — 1, 0, 0, . . .) and z = 0. How is the best fit solution and covariance 
matrix affected by this constraint? The answer is 

n l K 3 a °i ~ z \ 
5j = a\ — Dai D = , = CijKj. 

OLp 

I derive this in a more general context below. For now just note that by 

l 7-1 7-i\ K-iKj , q q, KiKj I 

cov(D 7 D) = ^cov(a",a") = ~T^ c io 



^ QLp Kp ) y CXp Kp j QLp Kp 

and 

cov(a^D) = — — cov(a^,a") = 



Qtpfcp cXphvp 
we obtain, 



dij = cov(ai, dj) = cov(a^, a°) — 2cov(a°, D)ctj + cov(.D, D)otiOtj = Cij * 



OLp Kp 



How does imposing this constraint affect the expected value of % 2 ? Substituting aj — > a i: 
we get all the original terms plus a A(% 2 ), 

A(x 2 ) = 2ai(Ddi) - 2a i b ij (a j D) + otiOtjbi^D 2 ) = a i c pj Kpb ij (cov(D, D) + (L>) 2 ), 



or 



A( X 2 ) = 1 + (D) 2 a p K p . 



That is, if the constraint really reflects reality, i.e. (D) = 0, then imposing the constraint 
increases the expected value of x 2 by exactly unity. However, if the constraint is untrue, 
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then imposing it will cause x 2 to go up by more. Hence, if the original model space 
represents reality and the constrained space continues to do so, then 

(x 2 ) = N — n + m, 

where N is the number of data points, n is the number of parameters, and m is the number 
of constraints. 

General Constraint and Proof 

As a practical matter, if one has m > 1 constraints, 

K^di = z k , (k = 1 . . .m), 

these can be imposed sequentially in a computer program using the above formalism. 
However, suppose in a fit of mathematical purity, you decided you wanted to impose them 
all at once. Then, following Gould & An (2002 ApJ 565 1381), but with slight notation 
changes, you should first rewrite, 

X 2 (a l ) = (ai - a^bijfaj - a°) + xl, 

where a? is the unconstrained solution and Xo * s ^ ne va hie of x 2 f° r that solution. At the 
constrained solution, the gradient of x 2 must he in the m dimensional subspace defined 
by the m constraint vectors k^. Otherwise it would be possible to reduce x 2 further while 
still obeying the constraints. Mathematically, 

bijiaj -a°) +D k K\ =0, 

where the D k are m so far undetermined coefficients. While we do not yet know these 
parameters, we can already solve this equation for the hi in terms of them, i.e., 

at = a° — D l a\, a\ = CijK^. 

Multiplying this equation by the K k yields m equations, 

k% = K k a° t - C kl D\ C kl = n k a\ = /cjcy/cj. 

Then applying the m constraints gives, 

C kl D l = A k , A k = n k a° l -z k , 

so that the solution for the D k is 

D k = B kl A l , B = C~\ 
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where now the implied summation is over the m constraints I = 1, . . . , m. Note that, 

cov{A k , A 1 ) = KiKjCovia®, a°j) = k\k\c^ = C kl , 

cov(A fc , D l ) = B lq cov(A h , A q ) = B lq C kq = 8 kh 
cov(D k , D l ) = B kq cov(A q , D l ) = B kq S q i = B kl , 

and that, 

cov(a°, D k ) = S M 4cov(a°, a°) = B kl a\. 

Hence, 

cov(a;, cij) = - 2a k cov(D k , a°) + a k a l jCOv{D k , D l ) = - 2a k B kl a l j + a k a l j B kl . 
That is, 

Cij CX-i B ^j* 

Finally, we can evaluate the expected value of x 2 directly taking into account the m 
constraints, 

(X 2 ) = B kl C kl -covfadi) + cov(A q ,D q ) + B kl (y k ){ yi ) - (a,)(^> + (A q )(D q ). 
That is, 

(X 2 ) = N-n + m+(y k - y k , mod )B kl ( yi - y ; , mod > + (A*)B™(A«). 

Summary 

The di (products of the data y k with the trial functions fi(xi)) and the 6^ (products 
of the trial functions with each other), 

di = y k B k ifi(xi), bij = fi{x k )B k ifj(xi), (B = C~\ C ki = (y k yi) - (y k ){yi)), 

are conjugate to the fit parameters ai and their associated covariance matrix c^, 

cov(di, dj) = bij, cov(ai, a,j) = Cy , di = bijdj, di = Cijdj, c = ft -1 . 

The parameter errors and covariances Cy can be determined just from the trial functions, 
without knowing the data values y k . 

Similarly the A p (products of the unconstrained parameters a? with the constraints 
k p ) and the C pq (products of the constraints with each other), 

A P = K^Cli , (JP1 __ i^djKj, 
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are conjugate to coefficients of the constrained-parameter adjustments D p and their asso- 
ciated covariance matrix B pq , 

cov(A p , A q ) = C p \ cov(D p , D q ) = B pq , A p = C pq D q , D p = B pq A q , B = C _1 . 

And, while the constrained parameters di of course require knowledge of the data, the 
constrained covariance matrix does not, 

d % = a° - D p a p , ~c %3 = c %3 - a p B pq a]. 

Note that the vector adjustments a P have the same relation to the constraints k p that the 
cii have to the di, 

a p = ClJ K p , K p = b i3 a P 
Finally, the expected value of x 2 is 

(X 2 ) = N-n + m+(y k - y k , mod )B kl (yi - yi,mod) + (A p )B pq (A q ). 

That is, the number of data points, less the number of parameters, plus the number of 
constraints, plus two possible additional terms. The first is zero if the model space spans 
the system being measured mo d) = {Vk))-> but otherwise is strictly positive. The second 
is zero if the constraints are valid ((A p ) = 0), but otherwise is strictly positive. 
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